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Abstract 

^ We introduce efficient Markov chain Monte Carlo methods for inference and model de- 

1^ termination in multivariate and matrix-variate Gaussian graphical models. Our framework is 

j—i based on the G-Wishart prior for the precision matrix associated with graphs that can be de- 

CN composable or non-decomposable. We extend our sampling algorithms to a novel class of 

,__, conditionally autoregressive models for sparse estimation in multivariate lattice data, with a 

lJ4 special emphasis on the analysis of spatial data. These models embed a great deal of flexi- 

^ bility in estimating both the correlation structure across outcomes and the spatial correlation 

structure, thereby allowing for adaptive smoothing and spatial autocorrelation parameters. Our 
methods are illustrated using simulated and real-world examples, including an application to 
, ^, cancer mortality surveillance. 

Keywords: CAR model; Gaussian graphical model; G-Wishart distribution; Lattice data; 
^ Markov chain Monte Carlo (MCMC) simulation; Spatial statistics. 

On 

^ 1 Introduction 



a 



Q Graphical models (LaurHzenj 1996 1, which encode the conditional independence among variables 



O using an undirected graph, have become a popular tool for sparse estimation in both the statis- 

tics and machine learning literatures. Implementing model selection approaches in the context of 
. ^ graphical models typically allows a dramatic reduction in the number of parameters under con- 

^ sideration, preventing overfitting and improving predictive capability. In particular, Bayesian ap- 

5^ proaches to inference in graphical models generate regularized estimators that incorporate uncer- 

tainty of the model structure. 



The focus of this paper is Bayesian inference in Gaussian graphical models (Dempsten 1972) 



based on the G-Wishart prior ( [Roverato] |2002[ |Atay-Kayis & Massam[ |2005^ |Letac & Massam 
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2007]). This class of distributions is extremely attractive since it represents the conjugate family for 



the precision matrix whose elements associated with edges in the underlying graph are constrained 
to be equal to zero. In addition, the G-Wishart naturally accommodates decomposable and non- 
decomposable graphs in a coherent manner. 

Many recent papers have described various stochastic search methods in Gaussian graphical 
models (GGMs) with the G-Wishart prior based on marginal likelihoods which, in this case, are 
given by the ratio of the normalizing constants of the posterior and prior G-Wishart distributions 



- see 'Atay-Kayis & Massam (2005); Jones et al. (2005); Carvalho & Scott (2009); Lenkoski & 



Dobra, ( ,2010 ) and the references therein. Although the computation of marginal likelihoods for 
decomposable graphs is straightforward, similar computations for non-decomposable graphs raise 
significant numerical challenges that are a consequence of having to approximate the normalizing 



constants of the corresponding G-Wishart posterior distributions. In that regard, [Lenkoski & Dobra 
(2010') point out that the Monte Carlo method of Atay-Kayis & Massam] ( 2005| ) is fast and accurate 
when employed for the computations of the normalizing constant of G-Wishart priors, although it 
is slow to converge for computations that involve G-Wishart posteriors. This remark leads to the 
idea of devising a Bayesian model determination method that avoids the computation of posterior 
normalizing constants. 

The contributions of this paper are twofold. First, we devise a new reversible jump sampler 
( ]Green] [1995 ) for model determination and estimation in multivariate and matrix-variate GGMs 
that avoids the calculation of posterior marginal likelihoods. The algorithm relies on a novel 
Metropolis-Hastings algorithm for sampling from the G-Wishart distribution associated with an 



arbitrary graph. Related methods are the direct sampling algorithm of Carvalho et al. (2007) and 



the block Gibbs sampler algorithm of Piccioni ( ]2000 ) and ]Asci & Piccioiii ( 2007| ). However, 
known applications of the former method involve exclusively decomposable graphs, while the lat- 
ter method relies on the identification of all the cliques of the graph, a problem that is extremely 
difficult in the case of graphs with many vertices or edges ( Tomita et al.] 2006] ). Our method does 
not require the determination of the cliques and works well for any graph as it is based on the 
Cholesky decompositions of the precision matrix developed in Roverato ( ]2000] 2002[ ); Atay-Kayis 



& Massam] ( 12005| ). Second, we devise a new flexible class of conditionally autoregressive models 
(CAR) ( Besag]]T974 ) for lattice data that rely on our novel sampling algorithms. The link between 



conditionally autoregressive models and GGMs was originally pointed out by |Besag & Kooperberg 
(JT995). However, since typical neighborhood graphs are non-decomposable, fully exploiting this 
connection within a Bayesian framework requires that we are able to estimate Gaussian graphical 
based on general graphs. Our main focus is on applications to multivariate lattice data, where our 
approach based on matrix-variate GGMs provides a natural approach to create sparse multivariate 
CAR models. 

The organization of the paper is as follows. Section |2] formally introduces GGMs and the 
G-Wishart distribution, along with a description of our novel sampling algorithm in the case of 
multivariate GGMs. This algorithm represents a generalization of the work of 'Giudici & Green 
([1999) and is applicable beyond decomposable GGMs. Section [3] discusses inference and model 



determination in matrix-variate GGMs. Unlike the related framework developed in Wang & West 



(2009) (which requires computation of marginal likelihoods and involves exclusively decompos- 



able graphs), our sampler operates on the joint distribution of the row and column precision ma- 
trices, the row and column conditional independence graphs and the auxiliary variable that needs 
to be introduced to solve the underlying non-identifiability problem associated with matrix-variate 
normal distributions. Section]?] reviews conditional autoregressive priors for lattice data and dis- 
cuss their connection to GGMs. This section discusses both univariate and multivariate models for 
continuous and discrete data based on generalized linear models. Section]5]presents three illustra- 
tions of our methodology: a simulation study, a spatial linear regression for modeling statewide 
SAT scores, and a cancer mortality mapping model. Finally, Section ]6] concludes the paper by 
discussing some of the limitations of the approach and future directions for our work. 



2 Gaussian graphical models and the G-Wishart distribution 

We let X = Xy,, Vp = {1, 2, ... ,p}, be a random vector with a p-dimensional multivariate normal 
distribution Np(0, K'^'). We consider a graph G = {Vp,E), where each vertex i e V corresponds 
with a random variable X, and E <z V x V are undirected edges. Here "undirected" means that 
{i, j) e Elf and only if (j, i) e E. We denote by Qp the set of all p{p - l)/2 undirected graphs with 
p vertices. A Gaussian graphical model with conditional independence graph G is constructed by 
constraining to zero the off-diagonal elements of K that do not correspond with edges in G ( |Demp 



ster 



1972). If (?', i) i E, Xi and Xj are conditional independent given the remaining variables. The 
\Kij\ . is constrained to the cone Pc of symmetric positive definite ma- 
for all (/, j) t E. 



precision matrix K 

trices with off-diagonal entries K^j 

We consider the G-Wishart distribution WISgC^, D) with density 



;?(K|G,5,D) = 



1 



/g(5,D) 



(det K) 



(<5-2)/2 



exp 



1 



:(K,D) 



(1) 



with respect to the Lebesgue measure on Pg ( [Roverato] |2002[ [Atay-Kayis & Massam]|2005[rLetac 
& Massam, 2007). Here (A,B) = tr(A^B) denotes the trace inner product. The normalizing 



constant /g(5, D) is finite provided 5>2 and D positive definite ( Diaconnis & Ylvisaker[]T979 ). 
We write K e Pg as 



K = Q'^ (^^t) Q, 



(2) 



whereQ = (2,jj . . andT = (*P,yj . are upper triangular, while D ' = Q'^Q is the Cholesky 

decomposition of D~^ We see that K = O^O with O = *PQ is the Cholesky decomposition of K. 
The zero constraints on the off-diagonal elements of K associated with G induce well-defined sets 
of free elements O''^^^ = {0,y : (z, j) e v(G)} and "V"^^^ = {¥,-y : (/, j) € v(G)} of the matrices 0* and 
T ( IRoveratoj ]2002| |Atay-Kayis & Massam|]2005] ), where 



v(G) = {(/, /) •.ieVp}U {(/, j) : / < J and (/, j) e E}. 

We denote by M^^^^ the set of incomplete triangular matrices whose elements are indexed by v(G) 
and whose diagonal elements are strictly positive. Both (P^^^> and *P''('^) must belong to M''^^\ The 



non-free elements of T are determined through the completion operation from Lemma 2 of |Atay 
Kayis & Massam (2005 1 as a function of the free elements T''^*^\ Each element *P,y with i < j and 
(/, j) ^ £" is a function of the other elements ^r^ that precede it in lexicographical order 



^v(G) ig 



Roverato (2002) proves that the Jacobean of the transformation that maps K e Pg to O"^*^^ 6 

p 



/(k^€»"(^)) = 2"]~[o;; 



v'^+l 



i=l 



where vf is the number of elements in the set {j : j > i and (i,j) e E}. Furthermore, Atay 



Kayis & Massam (2005 1 show that the Jacobean of the transformation that maps O"*^*^^ 6 Af'^^ to 

rpy{G) g ^viG) ig gi^gj^ by 



7(0^(G)^,pv(G)J = ]-[<2j+\ 



i=l 



where df is the number of elements in the set {j : j < i and (i, j) e E}. We have det K = H ^^ ^iid 

(=1 

^ii = ^uQii- It follows that the density of T""^*^^ with respect to the Lebesgue measure on M'^'-^^ is 
p (t"(^V, D) = piK\G, 6, D)7 (K ^ 0"(^)) J (o^^^^ ^ T^^^^) , 



'" flef^'fl •?;;"-' expU 2''' 



/g(*,D) 



(=1 



1=1 



U=i 



We note that vf + df represents the number of neighbors of vertex i in the graph G. 



2.1 Sampling from the G-Wishart distribution 

We introduce a Metropolis-Hastings algorithm for sampling from the G-Wishart distribution ([T|). 
We consider a strictly positive precision parameter cr^. We denote by K.^'^^ = Q^ r¥^^A T^^^Q the 

current state of the chain with (tI'I)"^^^ 6 M^^^K The next state Ri'+i] = Q^^Ti-^+iij^^Ti-^+iiQ is 
obtained by sequentially perturbing the free elements r¥^''A . A diagonal element *F|^? > is 
updated by sampling a value y from a N (w\^] , cr^A distribution truncated below at zero. We define 
the upper triangular matrix T' such that *P^ . = *P|*^ for {i, j) e v{G) \ {(/q, iq)] and ¥^^^.^ = y. The 
non-free elements of T' are obtained through the completion operation from (T')'^*^'^^ The Markov 
chain moves to K' = Q^ (^F')^ *P'Q with probability min {R,„, 1}, where 



R„ 



p[m'''^\5,ii)p[^^,\x.,,) ^KiMn) 



'o'o 






Here 0(-) represents the CDF of the standard normal distribution and 



'0 



R' 



(3) 



^^=-p{4Ef(nr-(^^o^ 



(4) 



A free off-diagonal element *F|'^. is updated by sampling a value 7' ~ N r¥\^\ , cr^A. We define 
the upper triangular matrix *P' such that ¥^ = *F|^^ for {i,j) e v(G) \ {(/o,7o)} and W.^j^ = y'. 
The remaining elements of *P' are determined by completion from (*P')'^^'^^ The proposal distri- 
bution is symmetric p (^^LqI^/*/ ) = P y^]"] l^i'o/o)' ^^^^ ^^ accept the transition of the chain from 
Rf"] to K' = Q^ (Y')^ Y'Q with probability min{i?;,„ 1}, where R'^ is given in equation ^. Since 
^^[.v]J^^^* 6 M''^^\ we have (Y')"^^^ 6 M^(^) which implies K' e Pq. 

We denote by K^'^'^ the precision matrix obtained after completing all the Metropolis-Hastings 
updates associated with the free elements indexed by v(G). The acceptance probabilities depend 
only on the free elements of the upper diagonal matrices T^*^ and *P'. An efficient implementation 
of our sampling procedure calculates only the final matrix K^^^'^ and avoids determining the inter- 
mediate candidate precision matrices K'. There is another key computational aspect that relates to 
the dependence of the Cholesky decomposition on a particular ordering of the variables. Consider 
two free elements W\^. and ¥[fl such that (io, jo) < (/[,, j'q) in lexicographical order. The comple- 



tion operation from Lemma 2 of Atay-Kayis & Massam ( 2005 ) shows that perturbing the value 
of W\^. leads to a change in the value of all non-free elements |^['^ : (z, j) ^ E and (/q, Jo) < (U j) 



On the other hand, perturbing the value of *pI*1 implies a change in the smaller set of elements 

'oh 

J*F|^^ : (?, j) i E and (Zq, Jq) < {i,j)\- This means that the chain is less likely to move when per- 
forming the update associated with *P|*| than when updating ¥^^1 . As such, the ordering of the 

'OiO IqJq 

variables is modified at each iteration of the MCMC sampler. More specifically, a permutation v is 
uniformly drawn from the set of all possible permutations Tp of V. The row and columns of D are 
reordered according to v and a new Cholesky decomposition of D~^ is determined. The set v{G) 
and [df : i eV) are recalculated given the ordering of the vertices induced by v, i.e. i <^ j if and 
only if v{i) < v(j). Changing the ordering guarantees that across iterations it is equally likely that 
(h), Jo) precedes or succeeds (/g, j,,). 

Our later developments from Section |3] involve sampling from the G-Wishart distribution ([!]) 
subject to the constraint ^n = 1. We have 

We subsequently obtain the next state K^-''^^^ of the Markov chain by perturbing the free elements 

/ r ,\V(G)\|(1,1)) 

/vpLM 1 When defining the triangular matrix T' we set ^[^ = \/Qn which implies that the 

corresponding candidate matrix K' has K[^^ = 1. Thus K.^"*^^ also obeys the constraint ^["i^^^ = 1. 
The random orderings of the variables need to be drawn from the set T), of permutations v e Tp 
such that 1/(1) = 1. This way the (1,1) element of K always occupies the same position. 

2.2 Bayesian inference in GGMs 

We let D = lx'-^\ . . . , x*"M be the observed data of n independent samples from Np(0, K~'). The 
likelihood function is 

p (D|K) = ilnyP'Hdct Kf^ exp |-^<K, U)| , (5) 



where U = 2 x*^\x^^^)^ is the observed sum-of-products matrix. Given a graph G, we assume 

a G-Wishart prior WiScC^o, Do) for the precision matrix K. We take 60 = 3 > 2 and Dq = Ip, 
the jc-dimensional identity matrix. This choice implies that prior for K is equivalent with one 
observed sample, while the observed variables are assumed to be apriori independent of each other. 
Since the G-Wishart prior for K is conjugate for the likelihood ([5]), the posterior of K given G is 
WiSoC" + 6o,lJ + Do). We also assume a uniform prior p(G) = 2/p{p - 1) on Q^. 
We develop a MCMC algorithm for sampling from the joint distribution 

p {D, K, G\6o, Do) ex p (£)|K) p (K|G, 60, Do) p(G), 

that is well-defined if and only if K 6 Pq. We denote the current state of the chain by (K^'^, G'^^^), 
K^'^ e PgI'I- Its next state (Rt'^'^^^G^^^^^), K^"'^^^ e PgI'+Hj is generated by sequentially performing 
the following two steps. We make use of two strictly positive precision parameters cr,„ and cTg that 
remain fixed at some suitable small values. We assume that the ordering of the variables has been 
changed according to a permutation v selected at random from the uniform distribution on Y^. We 
denote by (U + Do)"^ = (Q*)^Q* the Cholesky decomposition of (U + Do)"\ where the rows and 
columns of this matrix have been permuted according to v. 

We denote by nbdp(G) the graphs that can be obtained by adding an edge to a graph G 6 Qp 
and by nbd~(G) the graphs that are obtained by deleting an edge from G. We call the one-edge- way 
set of graphs nbdp(G) = nbdp(G) U nbdp(G) the neighborhood of G in ^p. These neighborhoods 
connect any two graphs in ffp through a sequence of graphs such that two consecutive graphs in 
this sequence are each others' neighbors. 

Step 1: Resample the graph. We sample a candidate graph G' 6 nbdp (g^'^A from the proposal 

q (G'lGf-^i) = ^Uni (nbd; (g^'^)) + ^Uni (nbd, (g^'^)) , (6) 

where Uni(A) represents the uniform distribution on the discrete set A. The distribution ([6]) gives an 
equal probability of proposing to delete an edge from the current graph and of proposing to add an 
edge to the current graph. We favor (6 1 over the more usual proposal distribution Uni (nbdp (G^'^ 1 1 
that is employed, for example, by Madigan & York ( 1995| ). If G^*^ contains a very large or a very 



small number of edges, the probability of proposing a move that adds or, respectively, deletes an 
edge from G^^^ is extremely small when sampling from Uni (nbdp (G^*^ j), which could lead to poor 
mixing in the resulting Markov chain. 

We assume that the candidate G' sampled from (|6]) is obtained by adding the edge iio,Jo)^ 
z'o < Jo, to G^'K Since G' e nbd^ (g^'A we have G^"^ e nbd" (G'). We consider the decomposition 
of the current precision matrix 



with (^[*])''(^ ' ) 6 M^'-^ ' \ Since the vertex /q has one additional neighbor in G', we have df = df\ 
dl = df + 1, v^' = vf + 1, vj' = vf and v(G') = v(g^'^) U {(/o, Jo)}- We define an upper 



triangular matrix T' such that ¥^ = *P|^'^ for (/, j) e v(G^'^). We sample 7 ~ N (^1^']^, crfj and set 



^''. . = y. The rest of the elements of T' are determined from (*P')^^<^ ' through the completion 



'QJO 



J[S] 



operation. The value of the free element *F^ . was set by perturbing the non-free element ^\ 
The other free elements of *P' and 'V^'^^ coincide. 

We take K' = (Q*)^ [cV'fw) Q*. Since (T')"^^'^ e M''^^'\ we have K' 6 Pc- The dimension- 
ality of the parameter space increases by one as we move from (K^^^, G'^^^) to (K', G'), thus we must 



make use of the reversible jump algorithm of Green ( 1995 1. The Markov chain moves to (K', G') 
with probability min IR^, 1 1 where R^ is given by 



r: = 



X 



p(D\K') p(K'\G\6o,Do) |nbd;(G[-i) 
p (D|K[^]) p (KM|G[-^], c5o, Do) Inbd^ (G') | 

J (K' ^ (wy^G'))^ J (((Y[^])''(G''), r) ^ (^"y(G'))^ 



(7) 



7 /km -^ (^M)v(gW)\ 



<T„ ijln 



exp 



I '0^0 'OJO/ 



20-2 



where |A| denotes the number of elements of the set A. Otherwise the chain stays at (Kt*^,G'^^^). 
Since (T')''(g'^') = (^[^])v(g[»1)^ the Jacobean of the transformation from ((^[•^])''(g'")^ ^J ^ (x^yiC) ^^ 

equal to 1 . Moreover, *pt'^ and *P' have the same diagonal elements, hence det K^'^ = f] \Q*ii^\i) 
det K'. It follows that equation (|7]) becomes 



(=1 






Ig'{So,Do) |nbd;(G')| 



X 



(8) 



xexp 



(k'-K['],U + Do) 



("¥'. . 

\ 'OJO 



hio) 



(Tt 



Next we assume that the candidate G' is obtained by deleting the edge (io,jo) from G^^\ We 



have d?' 



d^'' 


r/G' : 


= d^'' 


- 1, V^' : 


= V^'" 


-1 V^' 


'0 


' JO 


JO 


' '0 


'0 


' JO 



_ VI/[-V] 



vf" andv(G') 

Jo ^ ■' 



'(g[-^])\{Oo,Jo)}. We 



define an upper triangular matrix W such that ¥^ . = ^\'. for (?', j) 6 v(G'). The rest of the elements 
of T' are determined through completion. The free element ¥|*|. becomes non-free in T', hence 
the parameter space decreases by 1 as we move from («p[*])^('^') to (W'Y^^'^ e M'^'-^'K As before, 
we take K' = (Q*)^ ((T')^^') Q*. The acceptance probability of the transition from (Kf'l, Gl'^) to 

(K', G') is min j^^, l) where 



r: 



^ ' ^,o<o^;o7o .oW /g,(5o,Do) |nbd;(G')| 



X 



(9) 



xexp 



(k'-K[-^],U + Do) 



-I- 



\ '070 



hk) 



crt 



We denote by (K.^'^^'^\G^''-% ji[-^+V2] ^ q[s+i]^ ^j^g ^^^^g ^f ^^^ ^l^^^j^ ^^ ^^e end of this step. 

Step 2: Resample the precision matrix. Given the updated graph G^'''^^\ we update the pre- 
cision matrix K^*"^'^^^ = (Q*)^ /»pi+i/2j »p.v+i/2Q* ^^ sequentially perturbing the free elements 

(\p[.v+i/2]j Pqj. g^gj^ gyg]^ element, we perform the corresponding Metropolis-Hastings step 



from Section [Z2] with 6 = n + 60, D = V + Do and Q = Q*. The standard deviation of the normal 
proposals is cr„,. We denote by K^^^'^ e Pgii+ij the precision matrix obtained after all the updates 
have been performed. 

3 Matrix- variate Gaussian graphical models 

We extend our framework to the case when the observed data D = ix'^^K . . . ,x'^"^l are associated 
with a PrX pc random matrix X = (X,y) that follows a matrix-variate normal distribution 

vec (X^) |K«, Kc - Hp,p^ (O, [Kr ® Kd'') 
withp.d.f. dGupta & Nagarj [20001 ): 



p (X|K«, Kc) = iln)-""'''^ (det Kny^'^ (det KcY"'^ exp j-^tr [k«XKcX^]| . (10) 

Here Kr is a p« x pr row precision matrix and Kc is a pc x Pc column precision matrix. Further- 
more, we assume that K^ e Pgr and Kc e Pgc where Gr = (Vp^,Er] and Gc = (Vp^.,Ec) are two 
graphs with pr and pc vertices, respectively. We consider the rows Xi,., . . . , X^^,, and the columns 
Xh.1, . . . , X,pc of the random matrix X. From Theorem 2.3.12 of Gupta & Nagar| ( |2000[ ) we have 

Xl ~ Np^. (0, (k^i).. K-1) and X,^- ~ N^^ fo, (k^^) . . K-^). The graphs Gr and Gc define GGMs for 

the rows and columns of X ( Wang & West[ 2009[ ): 



X;,, IL X,-2. I X(y^,^\j,„-,,), « (Kr);,;^ = (Kr),.^;^ = « {iiJi) t Er, and (11) 

X*;i ii Xh.;2 I ^(Vp^\[jl,J2])* ^ ^^C)jij2 = (Kr);,;, = <^ (jl , J2) € Eq- 

Any prior specification for Kr and Kc must take into account the fact that the two precision ma- 
trices are not uniquely identified from their Kronecker product which means that, for any z > 0, 
(z~^KrJ ® (zKc) = Kr ® Kc represents the same precision matrix for vec(X^) - see equation ( 10). 



We follow the basic idea laid out in Wang & West (2009) and impose the constraint (Kc)i 



11 



Furthermore, we define a prior for Kc through parameter expansion by assuming a G-Wishart prior 
WiSGc(^c> Dc) for the matrix zKc with z> 0,5c > 1 and Dc 6 Pgc- It is immediate to see that the 
Jacobean of the transformation from zKc to (z, Kc) is 

y((zKc)^(z,Kc))=zl^(^^)l-^ 

It follows that our joint prior for (z, Kc) is given by 

;?(z,Kc|Gc,5c,Dc) = — -^— (det Kc)^ exp |-^<Kc,zDc)|z'^^"l^^^^^l-^ 



The elements of K/j e P^^ are not subject to any additional constraints, hence we assume a G- 
Wishart prior W\Sgk{5r,Dr) for Kr. We take 6c = Sr = 3, Dc = Ipc and Dr = I^^. We complete 
ourprior specification by assuming uniform priors p(Gc) = 2/pc(Pc-l) andpCG^) = 2/pr{pr-1) 
for the row and column graphs, where Gr e Qp^ and Gc e Qpc- 

We perform Bayesian inference for matrix- variate GGMs by developing a MCMC algorithm 
for sampling from the joint distribution of the data, row and column precision matrices and row 
and column graphs 

P{£),Kr,z,Kc,Gr,Gc\6c,'Dc,6r,\)r) <x p(D\Kr,Kc)x (12) 

xp {Kr\Gr, 6r, Dr) p (z, KcIGc, 6c, Dc) p (Gc) p (Gr) , 



that is well-defined for K^ 6 P^^, Kc 6 Pgc with (Kc)ii = 1 and z > 0. Equation ( 12) is written 
as: 

pri6r-'2.) I /^ M 1 npr-+6f?-2 npp+6r--2 

P(£),Kr,z,Kc,Gr,Gc\6c,Dc,6r,Dr) cx ^^-MGc)!-! (^et K«)^— (det Kc)^— x (13) 



xexp|--tr 



Y, K«x(^'Kc (x(^'))^ + KrDr + Kc (zDc) 



j=i 



Our sampling scheme is comprised of the following five steps that explain the transition of the 
Markov chain from its current state (K|;\ K[^^ G^^\z^'^) to its next state (K^^''^\K^^''^\G^^^^\z^'^^'^). 
We use four strictly positive precision parameters ct^r, (Tm,c, o-g,R and cr^ c- 



n . .J 



Step 1: Resample the row graph. We denote n\ = npc and U^ = 2 x^^^K^^ (x^^M . We generate 
a random permutation vr e Tp^ of the row indices Vp^ and reorder the row and columns of the ma- 
trix U^ -I- Dr according to vr. We determine the Cholesky decomposition f U^ -i- Dr) = f Q^ j Q^. 



We proceed as described in Step 1 of Section 2.2 Given the notations we used in that section, we 



take p = Pr, n = n^, U = U^, 5o = 5r, Dq = Dr and <Tg = cfg^R. We denote the updated row 
precision matrix and graph by [K.^^''^'^\G^^^^^), K^^^^''-^ e P^u+u. 

Step 2: Resample the row precision matrix. We denote n^ = npc and U^ = 2 x^^^K^!^ (x^^^) . 

We determine the Cholesky decomposition (U^ -I- D^j = (Q^) Q^ after permuting the row and 
columns of U^ -i- Dr according to a random ordering in Tp^. The conditional distribution of 
Kr e P(5i.+i) is G-Wishart WisQ[<+i] {n\ -I- 6r,\}\ -I- DrV We make the transition from Y^^^'^^ to 
j^^s+i] g p^[j+i] using Metropolis-Hastings updates described in Section 



2.1 Given the notations 
we used in that section, we take p = pr, 6 = n\ + 6r,D = \}\ + Dr, Q = Q^ and cr,„ = (T,n,R- 

Step 3: Resample the column graph. We denote n*r = npR and U!. = 2 (x^^^) K^'^'^x*^\ The 

7=1 ^ ' 



relevant conditional distribution is (see equation ([T3])): 

p{Kf,G'^,z'^\D,Kf'\6c,lic) - ^ J p^ (zt-jf^^\detKg^)^x (14) 

xexp|-i{KS!\U^+z['^]Dc)| 



2 
We sample a candidate column graph G^ e nbd^^ ip^cj fro^i the proposal 



where 1^ is equal to 1 if A is true and is otherwise. The proposal ( 15 1 gives an equal probability 
that the candidate graph is obtained by adding or deleting an edge from the current graph - see 
also equation ([6]). 

We assume that G^ is obtained by adding the edge (io,jo) to G^^\ We generate a random 
permutation vc 6 'fpc^^ of the row indices Vp^. and reorder the row and columns of the matrix 
U^ + z'^^Dc according to vc- The permutation vc is such that i/c(l) = 1, hence the (1, 1) element 

of K[^^ remains in the same position. We determine the Cholesky decomposition (U^ + z'*^Dc) = 
(Qc) Qc °f \^*c "•" ^^'^I^cj • We consider the decomposition of the column precision matrix 

with ("f^c)^^ ^ mH^c'). We define an upper triangular matrix "V'^ such that (W'^),, = (f^c^).. for 
(/,j) 6 v(g[^^). Wesampley ~ N((t[^^). . ,cr^^| and set (t^). . = y. The rest of the elements of 

T^ are determined from 1^^) through the completion operation. We consider the candidate 
column precision matrix 

K^ = (Qc)'(nf ^PcQc- (16) 

We know that K[!^ 6 P^[,] must satisfy (k^^^) = 1. The last equality implies (tJ^^) = 1/ (Q^) , 
hence (w^) = 1 / (Q^) . Therefore we have K;. e Pg^. and (k;.) = 1 . 
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We make the transition from (k[!\ G^^) to (k^, G'^) with probability min {i?J, ll where 

:,cV2;^(Qc)..(Qc)..('^?),,.7^(5^:d^- 



^C - CTg,, 



X exp < - 



1 



{k'c-K]^\Wc+z^'^Bc)- 






(17) 



'IQJO 



o- 



g,c 



U] 



Next we assume that G^ is obtained by deleting the edge (/q, Jo) from Gj^ . We define an upper 
triangular matrix T^ such that (^c) ~ (*^cj ^°^ ^''^^ ^ ^(^c)' "^'^^ candidate K^ is obtained 



from *P^ as in equation ( 16 1. We make the transition from (k[^\ G^^A to (K^, G^ 1 with probability 
minji?^, ll where 



R-, = Uc^{Q*,),,{Ql)..(^^X^^ 



-1 IcM^c'Oc) G"enbd-JG^^ 



c^^^'^pcy^c , 



/g'(<5c,Dc) 



G^enbd+ (G^) 



(zi^'i)' 



.]V<^c)l 



:;vx (18) 



xexp 



{k'c-K^^,V*c+z^'^Dc) + 



"-/(oyo \ "- '10 Jo 



2l 



CT 



8,C 



We denote the updated column precision matrix and graph by (k|^"^^ ^^, G[!^^^j. 

Step 4: Resample the column precision matrix. We denote n* = npR and Ut = 2 (x^^M Kj^^^^^x*^^ 

We determine the Cholesky decomposition (U^ + z'^^^Dc) = (Qc) Qc ^^^^^^ permuting the row 
and columns of U^ + z'^^^Dc according to a random ordering in Tp^ \ The conditional distribution 
of Kc 6 Pgi'+n with (Kc)ii = 1 is G-Wishart Wis^^u+u (n^ + 5c, U^ + z^'^Dc). We make the tran- 
sition from K[;'^ to K[!'^'^ 6 PqU+i] using Metropolis-Hastings updates from Section 2.1 Given 
the notations we used in that section, we take p = pc, 6 = n*^ + 6c, T> = IJ*^ + z^'^^T>c, Q = Q^ and 



Cm = o-m,c- The constraint (Kc)ii = 1 is accommodated as described at the end of Section 2.1 



Step 5: Resample the auxiliary variable. The conditional distribution of z > is 

Gamma(^^^%^ + |v(Gr") |, ^tr(4-^^Dc)l 



(19) 



Here Gamma(Q',jS) has density f{x\a,P) oc [i"yf ^ exp(-jSx). We sample from ( 19 1 to obtain z^^*'^^ 
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4 Sparse models for lattice data 



Conditional autoregressive (CAR) models ( Besag[ 1974, Mardia 1988) are routinely used in spa- 
tial statistics to model lattice data. In the case of a single observed outcome in each region, the 

data is associated with a vector X^ = (Xi, . . . ,XpA where X, corresponds to region i. The zero- 
centered CAR model is implicitly defined by the set of full conditional distributions 



Xi\{X,:i'i^i}-^H\ybu'X,,A^ 



i'*i 



i= 1,...,Pr. 



Therefore, CAR models are just two-dimensional Gaussian Markov random fields. According to 



Brook's ( ]1964[ ) theorem, this set of full-conditional distributions implies that the joint distribution 
for X satisfies 

p{X\A) oc exp |-^X^A-\I - B)X 

where A = diag{/lj, . . . , A^J and Bis a prX pr matrix such that B = (bu') and bu = 0. In order for 
A~^(I - B) to be a symmetric matrix we require that bwAj = b^A^ for / i^ i'; therefore the matrix B 
and vector A must be carefully chosen. A popular approach is to begin by constructing a symmetric 
proximity matrix W = (w,,/), and then set Z?„' = w„v/w,+ and A] = T^/Wi+ where w,+ = 2r' ^h' ^^d 
T^ > 0. In that case, A"\l - B) = t~^(Ew - W), where Ew = diag{wi+, . . . , Wp^+}. The proximity 
matrix W is often constructed by first specifying a neighborhood structure for the geographical 
areas under study; for example, when modeling state or county level data it is often assumed that 
two geographical units are neighbors if they share a common border. Given that neighborhood 
structure, the proximity matrix is specified as 



Wii' = 



1 1, if /' G di, 
1 0, otherwise. 



(20) 



where di denotes the set of neighbors of /. 

Specifying the joint precision matrix for X using the proximity matrix derived from the neigh- 
borhood structure is very natural; essentially, it implies that observations collected on regions that 
are not neighbors are conditionally independent from each other given the rest. However, note that 
the specification in (|20]) implies that (Ew - W)lp^ = and therefore the joint distribution on X is 
improper. Proper CAR models |Cressie| ( fT973| ) ; |Sun et al.| ( |2000l ) ; [Gelfand & Vounatsou] ( [2003] ) can 
be obtained by including a spatial autocorrelation parameter p, so that 



I \— 1 W:i> T 

XMXr:i'^i}-mpy^X,,- 



l'*l 



Wh 



(21) 



The joint distribution on X is then multivariate normal N^^ (0, D^M where Dw = r^ (Ew -pW)~^ 
This distribution is proper as long as p is between the reciprocals of the minimum and maximum 
eigenvalues for W. In particular, note that taking p = leads to independent random effects. 
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In the spirit of |Besag & Kooperberg| ( [T995| ), an alternative but related approach to the con- 
struction of models for lattice data is to let X ~ Np^(0, K"') and assign K a G-Wishart prior 
W\Sg„ {S, (6 - 2)Dw) where the graph Gw is implied by the neighborhood matrix W defined in 
(|20j). Following | Lenkoski & Dobra| ( |2010[ ), the mode of WIsg^, (S, (6 - 2)Dw) is the unique posi- 
tive definite matrix K that satisfies the relations 



K-i} = (Dw)ii' , if i' e di, and K,-,' = 0, if i' t di. 



(22) 



The matrix D J verifies the system ( 22 1, hence it is the mode of WiSc^ iS, {6 - 2)Diy). As such, the 



mode of the prior for K induces the same prior specification for X as (21 ) 
It is easy to see that, conditional on K e Pgw^ we have 






(23) 



Hence, by modeling X using a Gaussian graphical model and restricting the precision matrix K to 
belong to the cone Pgw^ we are inducing a mixture of CAR priors on X where the priors on 



bw 



-Kii'/Kji, if/' 6 di. 



\0, 



otherwise. 



and Aj = l/Ku are induced by the G-Wishart prior WISg^, (S, (6 - 2)Bw). 

The specification of CAR models through G-Wishart priors solves the impropriety problem and 
preserves the computational advantages derived from standard CAR specifications while provid- 
ing greater flexibility. Indeed, the prior is trivially proper because the matrix K e Pg„ is invertible 
by construction. The computational advantages are preserved because the full conditional distri- 
butions for each X, can be easily computed for any matrix K without the need to perform matrix 
inversion, and they depend only on a small subset of neighbors {Xj : j e di}. Additional flexibility 
is provided because the weights Z?,,' for i' 6 di and smoothing parameters Af are being estimated 
from the data rather than being assumed fixed, allowing for adaptive spatial smoothing. Indeed, 
our approach provides what can be considered as a nonparametric alternative to the parametric 



estimates of the proximity matrix proposed by Cressie & Chan ( 1989 ) 



A similar approach can be used to construct proper multivariate conditional autoregressive 
(MCAR) models ( |Mardia[ |1988| [Gelfand & Vounatsouj |2003[ ). In this case, we are interested in 
modeling a pr x pc matrix X = (X,y) where Xij denotes the value of the 7-th outcome in region 
i. We let X follow a matrix-variate normal distribution with row precision matrix Kr capturing 
the spatial structure in the data (which, as in univariate CAR models, is restricted to the cone 
Pg„ defined by the neighborhood matrix W), and column precision matrix Kc, which controls the 
multivariate dependencies across outcomes. It can be easily shown that the row vector X,* of X 
depends only on the row vectors associated with those regions that are neighbors with i — see also 
equation ( [TT] ): 






r + i\ 



N 



PC 



,„,(K»)„'''- 



1 



(Kfi), 



k; 
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This formulation for spatial models can also be used as part of more complex hierarchical models. 
Indeed, CAR and MCAR models are most often used as a prior for the random effects of a general- 
ized linear model (GLM) to account for residual spatial structure not accounted for by covariates. 
When no covariates are available, the model can be interpreted as a spatial smoother where the 
spatial covariance matrix controls the level of spatial smoothing in the underlying (latent) surface. 
Similarly, MCAR models can be used to construct sparse multivariate spatial GLMs. 

As an example, consider the pn x pc matrix Y = (7,^1 of discrete or continuous outcomes, and 
let Yij ~ hj{-\riij) where hj is a probability mass or probability density function that belongs to the 
exponential family with location parameter rjij. The spatial GLM is then defined through the linear 
predictor 

g-\r]ij)=tij + Xij + Zijfij, 

where g is the link function, jUj is an outcome- specific intercept, Xjj is a zero-centered spatial 
random effect associated with location z, Z,y is a matrix of observed covariates for outcome j at 
location z, and /3j is the vector of fixed effects associated with outcome j. We could further take 
yij ~ Poi(?7,y), g~^(-) = log(-) and assign a sparse matrix-variate normal distribution for X. This 
choice leads to a sparse multivariate spatial log-linear model for count data, which is often used 



for disease mapping in epidemiology (see Section p3| ). 

Due to the structure associated with our centering matrix Dr, and the presence of the expanded 
scaling parameter z, prior elicitation for this class of spatial models is relatively straightforward. 
Indeed, elicitation of the matrix D^ requires only the elicitation of the neighborhood matrix W, 
along with reasonable values for the spatial autocorrelation parameter p and scale t^. In particular, 
in the applications we discuss below we assume that, a priori, there is a strong degree of positive 
spatial association, and set p = 0.99. Also, and as is customary in the hterature on spatial models 
for lattice data, r^ is selected to roughly reflect the scale of data being modeled. In the case 
of MCAR models, it is common to assume that the prior value for the conditional covariance 
between variables is zero, which leads to choosing a diagonal Dc. When the scale of all the 
outcome variables is the similar (as in the applications below), it is then sensible to pick Dc = Ipc, 
as the actual scale is irrelevant. 



5 Illustrations 
5.1 Simulation study 

We consider a simulation study involving matrix-variate normal graphical models where pr = 5 
and Pc = 10. We sample 100 observations of matrices with row graph Gr defined by the edges 
{(l,z) : 1 < z < 5} U {(2, 3), (3, 4), (4, 5), (2, 5)} and column graph do. Here Cio denotes the 
10-dimensional cycle graph with edges {(z, z -I- 1) : 1 < z < 9} U {(1, 10)}. We note that both the 
row and column graphs are non-decomposable, with the row graph being relatively dense, while 
column graph is relatively sparse. In this case Kr 6 Pc^ is set so that (Kr),,- = 1, (Kr);^- = 0.4 
for {i,j) e Gr. Kc 6 Pco is defined in a similar manner. To generate the 100 samples from this 
matrix-variate distribution, we sample vec(X^) ~ Np^^^ (0, [Kr ® Kc]~^) and rearrange this vector 
to obtain the appropriate matrices. 
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Table 1: Results of simulation study in estimates for the row graph and Kr. In the matrix below, 
the values along the diagonal and in the upper triangle correspond to the average of the estimates 
of entries in K across the 100 repetitions, while the values below the diagonal correspond to the 
average of the edge probabilities across the 100 iterations. 








2 


3 


4 


5 


1 


1.107 


0.442 


0.438 


0.436 


0.44 


2 




1.105 


0.439 





0.441 


3 




1 


1.1 


0.437 





4 




0.026 


1 


1.086 


0.438 


5 




1 


0.04 


1 


1.096 



We ran the matrix- variate GGM search described in Section |3] for 50,000 iterations after a 
bum in of 5,000 iterations. We set cr,„R, cr,„c, (Tg^R and cTgc all to the value .5, which achieved 
an average rejection rate on updates of elements for both Kc and Kr of about 30 percent. We 
repeated the study 100 times to reduce sampling variability. Tables [1] and |2] show the resulting 
estimates for the row and column GGMs respectively. We see that the matrix-variate GGM search 
outlined above performs extremely well at recovering the structure of both the column and row 
graphs as well as the values of both Kc and Kr. Since both row and column GGMs are defined 
by non-decomposable graphs, we have illustrated that our methodology is capable of searching the 
entire graph space of matrix-variate GGMs. 



5.2 Modeling statewide SAT scores 

This section develops a model for the scores on the SAT college entrance exam for the year 1999. 
The data consists of statewide average verbal and mathematics SAT scores, along with the per- 
centage of eligible students taking the test, for each of the 48 contiguous states plus the District of 
Columbia. The verbal scores were originally analyzed by |Wall| ( 2004) , who used them to illustrate 
the differences among different models for lattice data. Figure [T] shows choropleth maps of the 
two scores; in both cases there is strong evidence of spatial association (note the clear clusters of 
midwestern, central, western and southeastern states). 

In this section we develop a bivariate spatial regression where both verbal and mathematics 
scores are treated as response variables, while the percentage of eligible students are treated as the 
explanatory variable. Figure |2] presents pairwise scatterplots of all three variables. They demon- 
strate that both verbal and mathematics scores are positively associated with each other, and neg- 
atively associated with the percentage of eligible students who take the test. This association 
between SAT scores and the percentage of eligible students could be explained by selection bias: 
in states where fewer students take the exam, it is typically only the best qualified who do, which in 
turn results in better averages than those from states who provide a wider access to testing. Hence, 
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Table 2: Results of simulation study for estimates of the column graph and Kc. In the matrix 
below, the values along the diagonal and above it to the average of the estimates of entries in Kc 
across the 100 repetitions, while the values below the diagonal correspond to the average of the 
edge probabilities across the 100 iterations. 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


1 


1 


0.395 


-0.001 


0.001 





-0.001 





0.001 


-0.001 


0.396 


2 


1 


0.941 


0.364 





0.001 








0.001 


0.002 


0.001 


3 


0.024 


1 


0.916 


0.365 





0.001 


-0.001 











4 


0.046 


0.032 


1 


0.932 


0.37 


-0.001 








0.001 


-0.001 


5 


0.022 


0.046 


0.033 


1 


0.936 


0.371 


-0.001 





-0.001 


0.001 


6 


0.04 


0.038 


0.036 


0.034 


1 


0.927 


0.362 


-0.001 





-0.001 


7 


0.037 


0.03 


0.052 


0.025 


0.042 


1 


0.928 


0.366 








8 


0.047 


0.046 


0.024 


0.053 


0.032 


0.049 


1 


0.935 


0.377 





9 


0.072 


0.082 


0.039 


0.029 


0.036 


0.068 


0.039 


1 


0.93 


0.364 


10 


1 


0.057 


0.037 


0.03 


0.047 


0.044 


0.035 


0.066 


1 


0.946 



Figure 1: SAT scores in the 48 contiguous states. Left panel shows verbal scores, while the right 
panel shows mathematics scores. 




(a) Verbal 



(b) Mathematics 



our model attempts to jointly understand the behavior of the scores adjusting for this selection bias. 
More specifically, we let Yij be the score on test j (j = 1 for verbal and j = 2 for math) and 
state / = 1, . . . , 49. We use the notation from Section ^and take pr = 49, pc = 2. The observed 
scores form a prX pc matrix Y = [Yijj. Since Figure 
scores and the percentage of eligible students, we let 



suggests a quadratic relationship between 



Y.=/3oj+fi,jZ,+fi2jZf+X,^ 
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where Z, represents the percentage of ehgible students taking the exam and Xjj is an error term. 
The vectors )S^ = \B()j,j3ij,/32j) , j = 1,2, are given independent normal priors Nsfby, QjM with 
by = (550,0,0)^ and Hj' = diag{225, 10, 10}. The residual matrix X = (Xjj) is given a matrix 
variate GGM 

vec(x^)|K«,Kc - U,^p,{0,[K„®Kcr'), 

(zKc)\6c,T>c - WiSGc(5c,Dc). 

where z > is an auxiliary variable needed to impose the identifiability constraint (Kc)ii = 1. 
The graph Gr, which captures the spatial dependencies among states, is fixed to the one induced 
by the neighborhood matrix W = (w„'), where w,y = 1 if and only if states i and i' share a border, 
while we set a uniform prior on Gc- The degrees of freedom for the G-Wishart priors are set as 
Sc = 3r = 3, while the centering matrices are chosen as Dc = I2 and, in the spirit of CAR models, 
Br = (6r - 2)wi+(Ew - 0.99W)~^ It follows that 

vec (y^) \Kr, Kc - Up,,,, (vec ((Zfif) , [K« KcV') , (24) 

where Z is a p^ x 3 matrix whose z-th row corresponds to the vector (1,Z,, Zi^) and)8 is a 3 x 2 
matrix whose j-th column is fij. 

A MCMC algorithm for sampling from the posterior distribution of this model's parameters is 
described in Appendix [A} This algorithm was run for 20000 iterations after a burn in of 1000 iter- 
ations. To assess convergence, we ran 10 instances of the algorithm at these settings, each initiated 
at different starting points and with different initial random seeds. Figure [8] shows the mean value, 
by iteration of the parameter z for each chain, which display significant agreement. 

Table [3] shows the posterior estimates for the regression coefficients ySg, ;Sj and)S2, along with 



the values reported by Wall (2004) for the verbal scores using two different models for the spatial 
structure: 1) a conventional CAR model, 2) an isotropic exponential variogram where the centroid 
of each state is used to calculate distances between states (see |Wallj 12004, for details). Note that 



the posterior means obtained from our model roughly agree with those produced by the more con- 
ventional approaches to spatial modeling. However, the Bayesian standard errors obtained from 
our model are consistently larger than the standard errors reported in [Wall (2004). We believe that 



this can be attributed to the richer spatial structure in our model, with the uncertainty in the spatial 
structure manifesting itself as additional uncertainty in other model parameters. 

As expected, the data suggest a positive association between SAT scores. The posterior proba- 
bility of an edge between verbal and math scores is 80%, suggesting moderate evidence of associ- 
ation between scores, while the posterior mean for the correlation between scores (conditional on 
it being different from zero) is 0.35 with a 95% credible interval of (0.09, 0.58). 

Figure [3] shows the fitted values of math and verbal scores by state. In comparison to Fig- 
ure [TJ we see considerable shrinkage of the fitted scores, particularly for the mathematics SAT. We 
note that, even after accounting for the percentage of students taking the exam, substantial spatial 
structure remains; generally speaking, central states (except Texas) tend to have higher SAT scores 
than coastal states on either side. This is confirmed by the results in Figure |4} which compares 
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the spatial interaction structure implied by the prior parameter D^^ to the structure implied by the 
posterior mean of Kr. Since both matrices share the same underlying conditional independence 
graph, both matrices have the same pattern of zeroes. If no spatial structure were present in the 
data we would expect the non-zero entries to have shrunk towards zero. However the posterior for 
K/j shows considerably larger variation in the spatial relationship structure than the prior. Indeed, 
while the prior imposes negative values for the off-diagonal of D^^ (implying a positive correla- 
tion), the posterior of K« has several positive values, the largest of which is the entry between 
Colorado and New Mexico, implying a negative correlation between deviations in the SAT scores 
of these two states. More generally, note that the posterior mean for Kr is very different from its 
prior mean, which suggest that the model is able to learn from the data about the spatial correlation 
structure. 



Figure 2: Scatterplots for the SAT data set. Panel (a) shows the average verbal SAT score in 
the state vs. the number of eligible students that take the test, panel (b) shows the average math 
SAT score against the number of eligible students, and panel (c) shows the average verbal vs. the 
average math scores. 
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5.3 Mapping cancer mortality in the U.S. 

Accurate and timely counts of cancer mortality are very useful in the cancer surveillance commu- 
nity for purposes of efficient resource allocation and planning. Estimation of current and future 
cancer mortality broken down by geographic area (state) and tumor have been discussed in a num- 
ber of recent articles, including |Tiwari et"aL] ( |2004| ), [Ghosh & Tiwari| ( [2007| ), [Ghosh et"aL] ( |2007| ) 
and [Ghosh et al.| ( p008| ). This section considers a multivariate spatial model on state-level cancer 
mortality rates in the United States for 2000. These mortality data are based on death certificates 
that are filed by certifying physicians and is collected and maintained by the National Center for 
Health Statistics (http : //www. cdc . gov/nchs) as part of the National Vital Statistics Sys- 
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Table 3: Posterior Estimates for the regression parameters in the SAT data. The columns labeled 
GGM-MCAR correspond to the estimates obtained from the bivariate CAR model described in 



Section 5.2 while the column labeled Wall partially reproduces the results from Wall (2004) for 



the verbal SAT scores. 





GGM-MCAR 


Wall (Verbal) 




Math 


Verbal 


CAR 


Variogram 


Intercept {fioj) 


580.12 


583.26 


584.63 


583.64 




(17.3) 


(15.91) 


(4.86) 


(6.49) 


Percent Eligible (fiij) 


-2.147 


-2.352 


-2.48 


-2.19 




(0.394) 


(0.498) 


(0.32) 


(0.31) 


Percent Eligible Squared OS2;) 


0.0154 


0.0171 


0.0189 


0.0146 




(0.0055) 


(0.0069) 


(0.004) 


(0.004) 



Figure 3: Fitted SAT scores in the 48 contiguous states. Left panel shows verbal scores, while the 
right panel shows mathematics scores. The scale is the same as in Figure[T] 




[479,502] 
(502,525] 
(525.548] 
(548,571] 
(571,594] 



(a) Verbal 



(b) Mathematics 



tem. The data is available from the Surveillance, Epidemiology and End Results (SEER) program 
of the National Cancer Institute (http : //seer, cancer . gov/seerstat ). 

The data we analyze consists of mortality counts for 1 1 types of tumors recorded on the 50 
states plus the District of Columbia. Along with mortality counts, we obtained population counts 
in order to model death risk. Figure[5]shows raw mortality rates for four of the most common types 
of tumors (colon, lung, breast and prostate). Although the pattern is slightly different for each of 
these cancers, a number of striking similarities are present; for example, Colorado appears to be a 
state with low mortality for all of these common cancers, while West Virginia, Pennsylvania and 
Florida present relatively high mortality rates across the board. 

A sparse Poisson multivariate loglinear model with spatial random effects was fitted to this 
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Figure 4: A comparison of D^ and the posterior mean of K^ in the SAT example. For com- 
parison, the heat maps have the same scale, showing the considerable added variability in spatial 
dependence imposed by the model. 



a 9 1 

» "It 

■ E- - 

n B 

s 



a 

g 



MM » 

R 

B 
■ S 



I ■ ■ 

■ B 

a a 



m m 



» « n 

■I 
a Bar 

It 

a 

a 



1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r 




<<3<oSfJQ'^5^-~--S£35§3^i5i22ii2i^^iooo^'^(nM?^^^&Si55 
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(b) Posterior 



data. The model is constructed along the lines described in Section m In particular, we let Yjj be 
the number of deaths in state i = I, ... ,pr = 51 for tumor type j = I, ... ,pc = 25. We set 



Yij\r]ij ~ Po\ (r]ij), 

log {rji/j = log (m,) + yUy + Xij, 



(25) 



Here m, is the population of state i, pj is the intercept for tumor type j, and Xij is a zero-mean 
spatial random effect associated with location i and tumor j. We follow the notation from Section 
3 and denote X,y = /j.j + Xij. We further model X = (XjA with a matrix-variate Gaussian graphical 
model prior: 



vec(X^)|//,K«,Kc - Up,,,lvcci^(l,yfY[Ki,®Kcr'y 
(zKc)|5c,Dc ~ WiSGc(5c,Dc). 



(26) 



Here 1/ is the column vector of length / with all elements equal to 1, ^ = (/ii, . . . ,fJ.pcV and z > 
is an auxiliary variable needed to impose the identifiability constraint (Kc)n = 1. The row graph 
Gr is again fixed and derived from the incidence matrix W derived from neighborhood structure 
across U.S. states, while the column graph Gc is given a uniform distribution. The degrees of 
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(c) Colon 



(d) Lung 




(e) Breast 



(f) Prostate 



Figure 5: Mortality rates (per 10,000 habitants) in the 48 contiguous states corresponding to four 
common cancers during 2000. 



freedom for the G-Wishart priors are set as 5r = 5c = 3, while the centering matrices are chosen as 
Dc = Ipc and D/j = {6r - 2)wi+(Ew - 0.99W)"^ We complete the model specification by choosing 
a uniform prior for the column graph p (Gc) o^ 1 and a multivariate normal prior for the mean rates 
vector// ~ Np^ (fi^, H^M where /Iq = /iolpc ^^'^ ^ = oj"^!,,^. We set/zo to the median log incidence 
rate across all cancers and regions, and co to be twice the interquartile range in raw log incidence 
rates. 

Posterior inferences on this model are obtained by extending the sampling algorithm described 
in Section [3| details are presented in Appendix |Bj Since mortality counts below 25 are often 
deemed unreliable in the cancer surveillance community, we treated them as missing and sampled 
them as part of our algorithm. Inferences are based on 50000 samples obtained after a discarding 
the first 5000 samples. We ran ten separate chains to monitor convergence. Figure [8] shows the 
mean value of z by log iteration for each of the ten chains. All ten agree by 50000 iterations. 

Figure [6] shows the graphical model constructed by adding an edge between any two variables 
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Figure 6: Graphical model from the cancer example. The eleven cancers are: brain (BN), breast 
(BR), colon (CO), kidney (KD), larynx (LA), lung (LU), oral (OR), ovarian (OV), prostrate (PR), 
skin (SK), and rectum (RE) 




if this edge was present in more than 50% of the total repetitions. This figure shows several inter- 
esting patterns. First, we see that lung cancer (LU) and colon cancer (CO) have the highest degree 
of connectivity. This is a useful result, since lung and colon cancer are the two most prevalent 
cancers in our dataset. This implies that variability in these cancers, after controlling for spatial 
effects, is a good predictor of variability in other lower count cancers. We also see a clique formed 
between lung, larynx (LA) and oral (OR) cancers, which is not surprising, as many cases of these 
cancers are believed to be related to chronic smoking. Another connexion we expected to see cor- 
responds to the edge between breast (BR) and ovarian (OV) cancer. Indeed, both of these cancers 
have been related to mutations in the BRAC genes, and it therefore is reasonable that increased 
levels of one would be correlated to an increase in the other. We find that Brain (BN) cancer rates 
are not correlated with the rates of any other cancer. Finally, Figure |7] shows the average fitted 
values for the cancer mortality by state for the four cancers shown in Figure [5} In a result similar to 
those show in the SAT example above, we see some shrinkage towards the mean, but a preservation 
of the spatial variability present in the underlying data. 

Table |4] shows the distribution of the imputed counts for those cancer/state combinations where 
the reported count was below 25. 
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Table 4: Distribution of imputed counts for cancers with fewer than 25 reported counts in a given 
state 



State 


Cancer 


Median 


95% Interval 


District of Columbia 


Brain 


13 


(3 , 24) 


North Dakota 


Brain 


19 


(9 , 25) 


District of Columbia 


Rectum 


19 


(11,24) 


North Dakota 


Rectum 


20 


(13 , 25) 


Wyoming 


Rectum 


15 


(9 , 22) 


Delaware 


Skin 


20 


(12 , 25) 


District of Columbia 


Skin 


15 


(7 , 23) 


North Dakota 


Skin 


17 


(10 , 23) 


South Dakota 


Skin 


20 


(13 , 25) 


Vermont 


Skin 


17 


(10 , 24) 


Wyoming 


Skin 


13 


(8 , 20) 


District of Columbia 


Kidney 


21 


(14 , 25) 


Wyoming 


Kidney 


20 


(13 , 25) 


Delaware 


Oral 


19 


(12 , 25) 


North Dakota 


Oral 


15 


(9 , 22) 


Rhode Island 


Oral 


22 


(16 , 25) 


South Dakota 


Oral 


18 


(11,24) 


Vermont 


Oral 


16 


(10,23) 


Wyoming 


Oral 


11 


(6,18) 


Delaware 


Larynx 


12 


(6 , 19) 


District of Columbia 


Larynx 


11 


(5 , 20) 


Idaho 


Larynx 


12 


(6,18) 


Maine 


Larynx 


20 


(11,25) 


Montana 


Larynx 


10 


(4,16) 


Nebraska 


Larynx 


18 


(11,25) 


New Hampshire 


Larynx 


17 


(11,24) 


New Mexico 


Larynx 


13 


(7,21) 


North dakota 


Larynx 


6 


(2,11) 


Rhode Island 


Larynx 


16 


(9,23) 


South Dakota 


Larynx 


8 


(4 , 14) 


Utah 


Larynx 


8 


(3 , 15) 


Vermont 


Larynx 


9 


(4 , 15) 


Wyoming 


Larynx 


5 


(2,9) 
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[0.95,1.27] 
(1.27, """ 
(1.59,1.91 
(1.91,2.23] 
(2.23,2.55] 




(1.83.3.22] 

(3.22,4.6] 

(4.6,5. 

(5.99,7.37] 

(7.37.8.76] 



(a) Colon 



(b) Lung 





(1.38,1.6] 
(1.6,1.82] 



(c) Breast 



(d) Prostate 



Figure 7: Fitted Mortality rates (per 10,000 habitants) in the 48 contiguous states corresponding to 
four common cancers during 2000. 
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Figure 8: Convergence plots for the SAT (left) and cancer examples (right). These plots show the 
mean of z, by log iteration for each of ten chains. 





Log Iteration 



Log Iteration 



6 Discussion 



This paper has developed and illustrated a surprisingly powerful computational approach for multi- 
variate and matrix- variate GGMs, and we believe that our application to the construction of spatial 
models for lattice data makes for particularly appealing illustrations. On one hand, the exam- 
ples we have considered suggest that the additional flexibility in the spatial correlation structure 
provided by out approach is necessary to accurately model some spatial data sets. Indeed, our 
approach allows for "non- stationary" CAR models, where the spatial autocorrelation and spatial 
smoothing parameters vary spatially. On the other hand, to be best of our knowledge, we are un- 
aware of any approach in the literature to construct and estimate sparse MCAR, particularly under 
a Bayesian approach. In addition to providing insights into the mechanisms underlying the data 



generation process, it has been repeatedly argued in the literature (see, for example, CarvaUio & 
West[ |20d7l and [Rodriguez et al.[|2010[ ) that sparse models tend to provide improved predictions, 
which is particularly important in such a levant application as cancer mortality surveillance. 

The proposed algorithms work well in the examples discussed here where the number of ge- 
ographical areas is moderate (that is, around fifty or less). Convergence seems to be achieved 
quickly (usually within the first 5000 iterations) and running times, although longer than for con- 
ventional MCAR models, are still short enough as to make routine implementation feasible (about 
1 hour for the simpler SAT dataset, and about 1 .5 days for the more complex cancer dataset, both 
running on a dual-core 2.8 Ghz computer running Linux). However, computation can still be very 
challenging when the number of units in the lattice is very high. In the future we plan to ex- 
plore parallel implementations (both on regular clusters and GPU cards), and to exploit the sparse 
structure of the variance-covariance matrix to reduce computational times. 
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A Details of the MCMC algorithm for the sparse multivariate 
spatial Gaussian model for SAT scores 



The Markov chain Monte Carlo sampler for the spatial regression model in Section 5.2 involves 
iterative updating of model parameters from six full conditional distributions. 

Step 1: Resample the regression coefficients fi^ = (jSoi,/3ii,;02i)- We denote by Y^y the j-th 
column of the p^ x pc matrix Y. From Theorem 2.3.11 of Gupta & Nagar| ( 2000| ) applied to 
equation (|24|) we have 



Y.ilY 



*2 



N 



PR 



Zfi, + (Y.2 - ZySj) 






[(Kc)„K»l 



-1 



Given the prior ^2 ~ N3 (b2, ^2) ^^^ the constraint (Kc)ii = 1, it follows that ySj is updated by 
sampling from the multivariate normal N3 \mp^,Q.n^] where 



ill,,=Z' K,Z + flz, nifc = "ft Z' Ks 



Y.i - (Y.2 - Z^j) 



{^-c% 



+ n7b 



2"2 



Step 2: Resample the regression coefficients ySj = (fioi^Pn^Pii)- Since aprioriyS^ ~ Nsmi,^/], 
it follows thatySj is updated by sampling from the multivariate normal N3 fmygj, HoM where 



^fii - (Kc)22 Z K^Z + Hi, niygj - n^^ 



(Kc)22 Z K/; 



Y.,-(Y., -Z,8,) 



(Kc-), 



+ aibi 



Step 3: Resample the row precision matrix K^ encoding the spatial structure. 

Step 4: Resample the column graph Gc- 

Step 5: Resample the column precision matrix Kc. 

Step 6: Resample the auxiliary variable z. 

After updating the regression coefficients in Steps 1 and 2, we recalculate the residual matrix 
X = [Xij], where X,y = Yjj - Pqj - P\jZi -jS2yZ^. The updated X represents a matrix variate dataset 
with a sample size n = I, based on which Kr, Gc, Kc and z are updated. Thus Steps 3, 4, 5 
and 6 above correspond to Steps 2, 3, 4 and 5 from Section [3| and no new algorithm needs to be 
described. Note that Step 1 of Section [3] is unnecessary as the row graph Gr is assumed known, 
and given by the neighborhood structure. 



B Details of the MCMC algorithm for the sparse multivariate 
spatial Poisson count model 

The Markov chain Monte Carlo sampler for the sparse multivariate spatial Poisson count model 



from Section 5.3 involves iterative updating of model parameters in six steps. 
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Step 1: Resample the mean rates fi. The full conditional for /i is 



1 



PR 



p (fi\X, K«, Kc) ex exp I - -tr [k« (X - Ip^) Kc (X - l^y) +lp,(p- tx^f ^(P't^o) 1 

It follows that fi is updated by direct sampling from the multivariate normal N^^ (m^, K„M where 
K^ = (iIKrIp,) Kc + PRn, m^ = K^^ [KcX^K^I^, + 7?«n;io] • 

Step 2: Resample X. We update the centered random effects X by sequentially updating each row 
vector Xit, i = I, . . . ,Pr. The matrix- variate GG M prior from equation (26 1 implies the following 
conditional prior for X;* ([Gupta & Nagar[ |2000 ): 



where 



Xfj{x,, :/'^/),Ai,K«,Kc - N^,(M,,a7i), 



(27) 



i'edi 



(K«), 



It follows that the full conditional distribution of X,,. is written as 

p (X,>| {X,vh. : i' 4^ i\ , Y,-,,/i, Kr, Kc) oc 



PC _ _ 

Y\ exp jl',7 (log(mO + X,-^) - m, exp (X;^)) 



;=i 



exp 



~ \T 



j[x,,-(M/]n,.[(x,,) 



M, 



(28) 



Since the distribution ( 28 ) does not correspond to any standard distribution, we design a Metropolis- 
Hastings algorithm to sample from it. We consider a strictly positive precision parameter a. For 
each 7 = 1, . . . , pc, we update the j-th element of X;* by sampling y ~ N (x,y, a^y We define a 

candidate row vector X"^''" by replacing X,y with y in X,,. . We update the current «-th row of X with 
X;f '' with probability 

[ p (X«-1 {X,, : /' ^ i\ , Y,-.,ti, Kr, Kc) ^ 

minn, ^ 

[ p (X„ I (X,v, : i' ^ /) , Yi,,n, Kr, Kc) 

Otherwise the ?-th row of X remains unchanged. 

Step 3: Resample the row precision matrix K^ encoding the spatial structure. 

Step 4: Resample the column graph Gc- 

Step 5: Resample the column precision matrix Kc. 

Step 6: Resample the auxiliary variable z. _ 

After completing Steps 1 and 2, we recalculate the zero-mean spatial random effects Xjj = X,y -fij. 

The updated X = (Xij) represents a matrix variate dataset with a sample size n = 1, based on which 

Kr, Gc, Kc and z are updated. Steps 3, 4, 5 and 6 above are Steps 2, 3, 4 and 5 from Section|3j 
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